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We consider the problem of jointly estimating a collection of 
graphical models for discrete data, corresponding to several cate¬ 
gories that share some common structure. An example for such a 
setting is voting records of legislators on different issues, such as de¬ 
fense, energy, and healthcare. We develop a Markov graphical model 
to characterize the heterogeneous dependence structures arising from 
such data. The model is fitted via a joint estimation method that 
preserves the underlying common graph structure, but also allows 
for differences between the networks. The method employs a group 
penalty that targets the common zero interaction effects across all 
the networks. We apply the method to describe the internal networks 
of the U.S. Senate on several important issues. Our analysis reveals 
individual structure for each issue, distinct from the underlying well- 
known bipartisan structure common to all categories which we are 
able to extract separately. We also establish consistency of the pro¬ 
posed method both for parameter estimation and model selection, 
and evaluate its numerical performance on a number of simulated 
examples. 


1. Introduction. The analysis of roll call data of legislative bodies has 
attracted a lot of attention both in the political science and statistical liter¬ 
ature. For political scientists, such data allow to study broad issues such as 
party cohesion as well as more specific ones such as coalition formation; see, 
for example, the books by Enelow and Hinich (1984), Matthews and Stimson 
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Fig. 1. Multidimensional scaling projection of roll call data of the U.S. Senate for the 
period 2005-2006 (Republicans shown in red and Democrats in blue). 


(1975), Morton (1999), Poole and Rosenthal (1997). A popular tool in po¬ 
litical science is the ideal point model [Clinton, Jackman and Rivers (2004)] 
that posits a one-dimensional latent political space along which legislators 
and bills they vote for are aligned. A legislator’s position corresponds to 
an ideal point, where bills coinciding with that position maximize his/her 
utility. These ideal points reveal legislators’ preferences and it is of interest 
to infer them from roll call data. An extension of this model that incorpo¬ 
rates information about the text of the bills being voted upon is discussed in 
Gerrish (2011), while the impact of absenteeism is examined in Han (2007). 

A statistical challenge is how to best model and present the roll call 
data in a way that makes interesting patterns apparent and facilitates sub¬ 
sequent analyses. A number of techniques have been employed including 
principal components analysis (PCA) [de Leeuw (2006)], multidimensional 
scaling (MDS) [Diaconis, Goel and Holmes (2008)], Bayesian spatial vot¬ 
ing models [Clinton, Jackman and Rivers (2004)], and graphical models for 
binary data [Banerjee, El Ghaoui and d’Aspremont (2008)]. 
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Dimension reduction techniques such as PCA and MDS aim at construct¬ 
ing a “map,” with the members of the legislative body positioned relative 
to their peers according to their voting pattern. A typical example of such a 
map of the U.S. Senate members in the 109th Congress (2005-2006) using 
multidimensional scaling for selected votes is shown in Figure 1; for a detailed 
description of the data see Section 4. A clear separation between members 
of the two parties is seen (Republicans to the left of the map and Democrats 
to the right), together with some members exhibiting a voting pattern de¬ 
viating from their party, for example. Nelson (Democrat of Nebraska), and 
Collins and Snow (Republicans of Maine), while the independent Jeffords 
(shown in purple) votes like a Democrat. More interestingly, the voting pat¬ 
terns within both parties form distinct subclusters. While the nature of this 
division is impossible to infer from an MDS or a PCA representation such 
as the one shown in Figure 1, our subsequent analysis will show that this 
difference is driven by votes on defense/security and healthcare issues. 

This finding suggests that treating all votes as homogeneous, that is, as¬ 
suming that they represent the same underlying relationship between sena¬ 
tors, may mask more subtle patterns which depend on the issues being voted 
upon. Therefore, treating votes as heterogeneous is more accurate and can 
provide further insight into the voting behavior of different groups of sena¬ 
tors on different issues. In this paper, we focus on voting records on three 
types of bills: defense and national security, environment and energy, and 
healthcare issues. Voting on the latter category is typically more partisan 
than voting on defense and national security and, thus, we expect to see 
different connections in different categories. 

The voting records of the U.S. Senate from the 109th Congress cover¬ 
ing the period 2005-2006 were obtained directly from the Senate’s website 
(www.senate.gov). We chose the 109th Congress because its voting patterns 
have been previously analyzed in the literature [see, e.g., Banerjee, El Ghaoui 
and d’Aspremont (2008)], but as we have discovered, the version of the data 
previously analyzed was contaminated with voting records from the 1990s 
(when the set of senators would have been different). Thus, we collected the 
data ourselves, on all the 645 votes that the Senate deliberated and voted 
on during that period, which include bills, resolutions, motions, debates and 
roll call votes. To study the potential heterogeneity in the voting patterns, 
we focused on the three largest meaningful (i.e., excluding purely procedural 
votes) categories of votes extracted from bills, resolutions and motions: (1) 
defense and security issues; (2) environment and energy issues; (3) health 
and medical care issues. The categories were extracted by a combination of 
text analysis of bill names and manual labeling. A complete analysis of this 
data set will be presented in Section 4. 

Our goal in this paper is to develop a statistical model for studying de¬ 
pendence patterns in such situations: there is some overall structure present 
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(party affiliation, which affects everything) and there are also distinct cate¬ 
gories with their own individual structures. Since we are dealing with voting 
data, we use Markov network models to capture the dependence structure of 
binary or categorical random variables. Similar to Gaussian graphical mod¬ 
els, nodes in a Markov network correspond to (categorical) variables, while 
edges represent dependence between nodes conditional on all other variables. 
Graphical models are an exploratory data analysis tool used in a number 
of application areas to explore the dependence structure between variables, 
including bioinformatics [Airoldi (2007)], natural language processing [Jung 
et al. (1996)1 and image analysis [Li (2001)]. In the case of Gaussian graphi¬ 
cal models, which assumes the variables are jointly normally distributed, the 
structure of the underlying graph can be fully determined from the corre¬ 
sponding inverse covariance (precision) matrix, the off-diagonal elements of 
which are proportional to partial correlations between the variables. A num¬ 
ber of methods have been recently proposed in the literature to fit sparse 
Gaussian graphical models [see, e.g., Meinshausen and Biihlmann (2006), 
Yuan and Lin (2007), Banerjee, El Ghaoui and d’Aspremont (2008), Roth¬ 
man et al. (2008), Ravikumar et al. (2011), Peng, Zhou and Zhu (2009) and 
references therein]. Sparse Markov networks for binary data (Ising mod¬ 
els) have been studied by Hofling and Tibshirani (2009), Guo et al. (2009), 
Ravikumar, Wainwright and Lafferty (2010), Anandkumar et al. (2012), 
Xue, Zou and Cai (2012). These methods do not allow for different cate¬ 
gories within the data. 

To allow for heterogeneity, we develop a framework for fitting different 
Markov models for each category that are nevertheless linked, sharing nodes 
and some common edges across all categories, while other edges are uniquely 
associated with a particular category. This will allow us to borrow strength 
across categories instead of fitting them completely separately. For the Gaus¬ 
sian case, this type of joint graphical model was first studied by Guo et al. 
(2011), who proposed a joint likelihood based estimation method that bor¬ 
rowed strength across categories. Several other papers have proposed alter¬ 
native algorithms for the Gaussian case [Danaher, Wang and Witten (2011), 
Yang et al. (2012), Hara and Washio (2013)]. We note that a context-specific 
graphical model was proposed for count data in the form of contingency ta¬ 
bles by Hpjsgaard (2004), but contingency tables are not suitable for high¬ 
dimensional data and the context-specific model is not sparse. 

The advantage of using a Markov graphical model in this context is that it 
quantifies the degree of conditional dependence between the senators based 
on their voting record, and hence the obtained network, and is directly inter¬ 
pretable. Techniques like multidimensional scaling and principal components 
analysis represent relative similarities between senators’ voting records on 
the map and, hence, the distance between any two senators can be inter¬ 
preted as a quantitative measure of similarity between their voting records. 
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However, unlike in a Markov network, these distances are not interpretable 
in the context of a generative probability model. 

The remainder of the paper is organized as follows. Section 2 introduces 
the Markov network and addresses algorithmic issnes, and Section 3 briefly 
illnstrates the performance of the joint estimation method on simnlated 
data. A detailed analysis of the U.S. Senate’s voting record from the 109th 
Congress is presented in Section 4. Some concluding remarks are drawn in 
Section 5, and the Appendix presents results on the asymptotic properties 
of the method. The electronic supplementary material contains a detailed 
investigation of missing data impntation methods for the Senate vote data. 

2. Model and estimation algorithm. In this section we present the Markov 
model for heterogeneous data, focusing on the special case of binary vari¬ 
ables (also known as the Ising model). The extension to general categorical 
variables is briefly discnssed in Section 5. We start by discnssing estimation 
of separate models for each category and then develop a method for joint 
estimation. 

The main technical challenge when estimating the likelihood of Markov 
graphical models is its computational intractability due to the normalizing 
constant. To overcome this difficulty, different methods employing compu¬ 
tationally tractable approximations to the likelihood have been proposed in 
the literatnre; these inclnde methods based on snrrogate likelihood [Baner- 
jee, El Ghaoui and d’Aspremont (2008), Kolar and Xing (2008)] and pseudo¬ 
likelihood [Hofling and Tibshirani (2009), Raviknmar, Wainwright and Laf- 
ferty (2010), Guo et al. (2010)]. Hofling and Tibshirani (2009) also proposed 
an iterative algorithm that snccessively approximates the original likelihood 
throngh a series of pseudo-likelihoods, while Ravikumar, Wainwright and 
Lafferty (2010) and Guo et al. (2010) established asymptotic consistency of 
their respective methods. 

2.1. Problem setup and separate estimation. We start from setting up 
notation and reviewing previous work on estimating a single Ising model, 
which can be nsed to estimate the graph for each category separately. Sup¬ 
pose that data have been collected on p binary variables in K categories, with 
Uk observations in the A:th category, k = 1,... ,K. Let ..., 

denote a p-dimensional row vector containing the data for the ith observa¬ 
tion in the fcth category and assume that it is drawn independently from an 
exponential family with the probability mass fnnction 


( 2 . 1 ) 
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The partition function 

ensures that the probabilities in (2.1) add up to one. The parameters , 
1 < J < P correspond to the main effect for variable Xj in the kth category, 

and 6j J, is the interaction effect between variables Xj and Xjr, 1 < j < j < p- 
The underlying network associated with the kth. category is determined by 
the symmetric matrix 0^^^ = Specifically, if = 0, then Xj 

and Xj! are conditionally independent in the kih. category given all the 
remaining variables and, hence, their corresponding nodes are not connected. 
For each category, (2.1) is referred to as the Markov network in the machine 
learning literature and as the log-linear model in the statistics literature, 
where 9^ ', is also interpreted as the conditional log odds ratio between 
Xj and Xji given the other variables. Although general Markov networks 
allow higher order interactions (3-way, 4-way, etc.), Ravikumar, Wainwright 
and Lafferty (2010) pointed out that in principle one can consider only 
the pairwise interaction effects without loss of generality, since higher order 
interactions can be converted to pairwise ones by introducing additional 
variables [Wainwright and Jordan (2008)]. For the rest of this paper, we only 
consider models with pairwise interactions of the original binary variables. 

The simplest way to deal with heterogenous data is to estimate K separate 
Markov models, one for each category. If one further assumes sparsity for 
the kth. category, the structure of the underlying graph can be estimated by 
regularizing the log-likelihood using an ii penalty: 


( 2 . 2 ) 
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The ii penalty shrinks some of the interaction effects 9^- ■, to zero and A 
controls the degree of sparsity. However, estimating (2.2) directly is compu¬ 
tationally infeasible due to the nature of the partition function. A standard 
approach in such a situation is to replace the likelihood with a pseudo- 
likelihood [Besag (1986)], which has been shown to work well in a range 
of situations. Here, we use a pseudo-likelihood estimation method for Ising 
models [Hofling and Tibshirani (2009), Guo et al. (2010)], based on 
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j<j' 


where 0^^) is restricted to be symmetric. Criterion (2.3) can be efficiently 
maximized using the modified coordinate descent algorithm of Hofling and 
Tibshirani (2009). 


2.2. Joint estimation of heterogeneous networks. The separate estima¬ 
tion methods reviewed in the previous section do not take advantage of the 
shared nodes among the categories and potential common structure. Our 
goal here is to explicitly include this into the estimation procedure. We 
start by reparameterizing each as 

(2.4) ’ l<j^j'<P;l<k<K. 

(k) 

To avoid sign ambiguities between and Jj J/, we restrict 4>jji > 0, 1 < 
j < j' < p. To preserve the symmetry of we also require cfjji = (fj/j 

and yjy, = yj, 1, for all 1 < j < j <p and 1 < k < K. Moreover, for identi- 

hability reasons, we restrict the diagonal elements cfjj = 1 and . 

Note that 4>jji is a common factor across all K categories that controls the 
occurrence of common links shared across categories, while is an in¬ 
dividual factor specific to the fcth category. The proposed joint estimation 
method maximizes the following penalized criterion: 


™ E - E E + E T 
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where = {(l>j,j')pxp and = (yj jv)pxp. The tuning parameter yi con¬ 
trols sparsity of the common structure across the K networks. Specihcally, if 
is shrunk to zero, all 9[^],,... ,9^^) are also zero and, hence, there is no 
link between nodes j and j' in any of the K graphs. Similarly, r]2 is a tuning 
parameter controlling sparsity of links in individual categories. Due to the 
nature of the ii penalty, some of s will be shrunk to zero, resulting in 
a collection of graphs with individual differences. Note that this two-level 
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penalty was originally proposed by Zhou and Zhu (2007) for group variable 
selection in linear regression. 

The criterion (2.5) achieves the stated goal of estimating common struc¬ 
ture and hence borrows strength across the K data categories, but requires 
the selection of two tuning parameters. However, there is an equivalent cri¬ 
terion presented next that only involves a single tuning parameter, thus 
simplifying the estimation task 


K rik p r / 

E - E E i + E / 


( 2 . 6 ) 


log 11 -h exp f 


^ E 

i<j<j'<p 




K 


Ei«17' 


k=l 


where A = 2y/r]i'q2- The optimization problems given by (2.5) and (2.6) are 
equivalent in the sense that for each pair of {r]i,r] 2 ) there is a A that gives 
the same solution and vice versa. Their equivalence can be formalized as 
follows (here A • B denotes the Schur-Hadamard element-wise product of 
two matrices): 


^ (^) 

Proposition!. Let{& be a local maximizer of (2.6). Then there 

^ ^ ^ (A;) ^ ^(fc) 

exists a local maximizer of (2.5), (^, such that 0 = $ • T , 

^ ^(fc) 

for all 1 < k < K. On the other hand, if ($, {T '\k=^) ® local maximizer 

of (2.5), then there also exists a local maximizer of (2.6), {© such 

that 0^ ^ • r*' \ for all 1 < k < K. 


The proof of this proposition is similar to the proofs of Lemma 1 and 
Theorem 1 in Zhou and Zhu (2007) and is omitted here. Note that even 
though choosing a single tuning parameter A corresponds to a particular 
path in the (i/i, r/ 2 ) space, this restriction affects only the individual estimates 
(fjj' and 7jj', but not their product Ojji. 


2.3. Algorithm and model selection. Criterion (2.6) leads to an efficient 
estimation algorithm based on the local linear approximation. Specifically, 
letting (^j denote the estimates from the tth iteration, we approximate 
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at the {t + l)th iteration, problem (2.6) is decomposed into K individual 
optimization problems: 
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Note that criterion (2.7) is a variant of criterion (2.3) with a weighted ii 
penalty and hence can be solved by the algorithm of Hofling and Tibshirani 

(2009). For numerical stability, we threshold \jY^k=i 10“^°. The 

algorithm is summarized as follows: 

Step 1. Initialize jVs (1 < j,j' <p;l<k< K) using the estimates from 
the separate estimation method; 

Step 2. For each 1 < A; < iF, update ^ by solving (2.7) using the 
pseudo-likelihood algorithm Hofling and Tibshirani (2009), Guo et al. (2010). 
Step 3. Repeat step 2 until convergence. 



The tuning parameter A in (2.6) controls the sparsity of the resulting 
estimator and can be selected using cross-validation. Specifically, for each 
1 <k < K, we randomly split the data in the fcth category into D subsets of 
similar sizes and denote the index set of the observations in the dth subset 
as T'j^\ 1 < d < D. Then A is selected by maximizing 



where is the cardinality of and is the joint estimate 

of 9j^j, based on all observations except those in U • • • U , as well as 
the tuning parameter A. 


3. Simulation study. Before turning our attention to examining the U.S. 
Senate voting patterns, we evaluate the performance of the joint estimation 
method on three synthetic examples, each with p = 100 variables and K = 3 
categories. The network structure in each example is composed of two parts: 
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the common structure across all categories and the individual structure spe¬ 
cific to a category. The common structures in these examples are a chain 
graph, a nearest neighbor graph and a scale-free graph. These graphs are 
generated as follows: 


Example 1: Chain graph. A chain graph is generated by connecting nodes 
1 to p in increasing order, as shown in Figure 2(Al). 

Example 2: Nearest neighbor graph. The data generating mechanism of 
the nearest neighbor graph is adapted from Li and Gui (2006). Specifically, 
we generate p points randomly on a unit square, calculate all p(p — l)/2 
pairwise distances, and find three nearest neighbors of each point in terms 
of these distances. The nearest neighbor network is obtained by linking any 
two points that are nearest neighbors of each other. Figure 2(B1) illustrates 
a nearest-neighbor graph. 

Example 3: Scale-free graph. A scale-free graph has a power-law degree 
distribution and can be simulated by the Barabasi-Albert algorithm 
[Barabasi and Albert (1999)]. A realization of a scale-free network is de¬ 
picted in Figure 2(C1). 


In each example, the network for the /cth category (A: = 1,..., K) is created 
by randomly adding links to the common structure. The individual links 
in different categories are disjoint and have the same degree of sparsity, 
measured by p, the ratio of the number of individual links to the number 
of common links. In particular, p = 0 corresponds to identical networks for 
all three categories. In the simulation study, we consider p = 0, 1/4 and 1, 
gradually increasing the proportion of individual links (Figure 2). Given the 
graphs, the symmetric parameter matrix 0(^1 is generated as follows. Each 
6) = 9), ■ corresponding to an edge between nodes j and j is uniformly 

drawn from [-1,-0.5] U [0.5,1], whereas all other elements are set to zero. 
Then we generate the data using Gibbs sampling. Specifically, suppose the 
ith iteration sample has been drawn and is denoted as (x^^^)M,..., (xp^^) M; 

then, in the {t + l)th iteration, we draw , 1 < j < p, from the 

Bernoulli distribution: 


(3.1) 


(fc)\ [t+i] 


Bernoulli ( 




Vl+exp(«'y+E,.,.PSl(:^‘?)l‘') 


To ensure that the simulated observations are close to i.i.d. samples from the 
target distribution, the first 1,000,000 rounds are discarded (burn-in) and the 
data are collected every 100 iterations from the sampler. In the simulation 
study, we consider a balanced scenario and an unbalanced scenario. The 
former consists of Uk = 300 observations in each category, whereas the latter 
has three unbalanced categories with sample sizes ni = 200, n 2 = 300 and 
na = 400. 
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(A1) Chain: rho=0 


(B1) Nearest-neighbor: rho=0 


{C1) Scale-free: rho=0 



(A2) Chain: rho=1/4 


(B2) Nearest-neighbor: rho-1/4 


(C2) Scale-free: rho=1/4 



(A3) Chain: rho=1 


(B3) Nearest-neighbor: rho=1 


(C3) Scale-free: rho=1 



Eig. 2. The networks used in three simulated examples. The black lines represent the 
common structure, whereas the red, blue and green lines represent the individual links 
in the three categories, p is the ratio of the number of individual links to the number of 
common links. 

We compared the structure estimation results of the joint estimation 
method and the separate estimation method using ROC curves, which dy¬ 
namically characterize the sensitivity (proportion of correctly identified links) 
and the specificity (proportion of correctly excluded links) by varying the 
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Chain; p=0 



0.00 0.10 0.20 0.30 

1 - Specificity 


Nearest-neighbor: p=0 



Scale-free: p=0 



0.00 0,10 0.20 0.30 

1 - Specificity 




Chain: p=1/4 



Chain; p=1 



Nearest-neighbor: p=1/4 Scale-free: p=1/4 




Nearest-neighbor: p=1 



Scale-free: p=1 



Fig. 3. Results for the balanced scenario (ni = n 2 = ns = 300J and dimension p— 100. 
Black solid curve: joint estimation; red dashed curve: separate estimation. The ROC curves 
are averaged over 10 replications, p is the ratio between the number of individual links and 
the number of common links. 


tuning parameter A. Figure 3 shows the ROC curves averaged over 10 repli¬ 
cations from the three examples in the balanced scenario, where the joint 
estimation method dominates separate estimation when the proportion of 
individual links is low. As p increases, the structures become more different. 
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and the joint and separate methods move closer together. This is expected, 
since the joint estimation method is designed to take advantage of common 
structure. The results in the unbalanced scenario exhibit a similar pattern 
(Figure 4). 


Chain: p=0 


Nearest-neighbor: p=0 


Scale-free: p=0 



0.00 0.10 0.20 0.30 

1 - Specificity 



0.00 0.10 0.20 0.30 

1 - specificity 



0.00 0.10 0.20 0.30 

1 - Specificity 


Chain: p=1/4 



Chain: p=1 



Nearest-neighbor: p=1/4 Scale-free: p=1/4 




Nearest-neighbor: p=1 


Scale-free: p=1 




Eig. 4. Results for the unbalanced scenario (ni = 200, n 2 = 300, ns = 400j and dimen¬ 
sion p= 100. Black solid curve: joint estimation; red dashed curve: separate estimation. The 
ROC curves are averaged over 10 replications, p is the ratio between the number of indi¬ 
vidual links and the number of common links. 
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4. Analysis of the U.S. Senate voting records. We applied the proposed 
joint estimation method to the voting records of the U.S. Senate from the 
109th Congress covering the period 2005-2006. The p = 100 variables corre¬ 
spond to the senators. The Senate held 645 votes in that period, from which 
we extracted n = 222 votes in the three largest categories, namely, defense 
and security (141), environment and energy (34), and healthcare (47). The 
votes are recorded as “yes” (encoded as “1”) and “no” (encoded as “0”). 
The assumption of our model is that bills within a category are an i.i.d. 
sample from the same underlying Ising model. In reality, the voting process 
may be more complex, with possible temporal factors and further depen¬ 
dencies among bills, possibly reflecting backroom deals. Neverthless, this is 
an improvement on previous analyses of such data, which treated all bills in 
all categories as i.i.d. [Banerjee, El Ghaoui and d’Aspremont (2008)], and is 
a reasonable trade-off for an exploratory data analysis tool. 

There were missing observations, as not all senators vote on all bills. 
The number of bills containing at least one missing vote was 98 out of 
141 for defense and security, missing a total of 2.26% of all votes; 24 out 
of 34 for environment and energy, missing a total of 3.23% of votes; and 
20 out of 47 for healthcare, missing 2.38% of all votes. While the number 
of bills that are missing at least one Senator’s vote is relatively high, the 
overall proportion of missing observations is quite low and, thus, we do 
not expect it to create a major problem in the analysis. Nevertheless, we 
have investigated multiple strategies for imputing the missing data in the 
electronic supplement; specihcally, we considered replacing the missing vote 
by the party’s majority, by the majority vote of the five most similar Senators 
and, to test robustness to the imputation method, also by the opposite 
party’s majority and at random. We found that the main conclusions of 
the analysis are not very sensitive to missing data imputation methods. In 
the subsequent analysis, we replace a missing vote for a Senator by his/her 
party’s majority vote on the bill; for the Independent Senator Jeffords, we 
take the Democratic majority vote. After the imputation, the bills with a 
“yes/no” proportion greater than 90% or less than 10% were excluded from 
the analysis, as these typically correspond to procedural votes. This left 97, 
29 and 40 bills in the three categories, respectively. Given that two of the 
sample sizes are fairly small (29 and 40), we added an £2 penalty with a 
small tuning parameter A 2 = 0.01. This approach, known as the elastic net, 
has been shown to help avoid extremely sparse networks in such situations 
[Zou and Hastie (2005)]. 

The main tuning parameter for our method was selected through cross- 
validation. Following Li and Gui (2006), we used a bootstrap procedure for 
final edge selection, estimating the network for 100 bootstrap samples of the 
same size, and only retained edges that appeared more that a percent of 
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the time. This procedure is similar to stability selection [Meinshausen and 
Biihlmann (2010)]. 

The network representation, depicting both the common and the indi¬ 
vidual structures with a cutoff value for inclusion a = 0.4 and a value of 
A = 0.05, is depicted in Figure 5. Note that unlike techniques such as princi- 


Common Structure 


Defense and Security 



Environment and Energy 


Heaith and Medical Issues 




Fig. 5. The estimated graphical models for the three categories in the Senate voting data 
with an inclusion cutoff value of 0.4 and tuning parameter value of 0.5. Edges common to 
all three categories are shown under the heading “common structure”; all other edges are 
shown on category-specific graphs. The nodes represent the 100 senators, with red, blue 
and purple node colors corresponding to Republican, Democrat or Independent (Senator 
Jeffords), respectively. A solid line corresponds to a positive interaction effect and a dashed 
line to a negative interaction effect. The width of a link is proportional to the magnitude 
of the corresponding overall interaction effect. 
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pal components analysis and multidimensional scaling that directly embed 
the senators in a two-dimensional map, the proposed method estimates the 
edges and constructs the adjacency matrix of the graph of Senators; sub¬ 
sequently, we employed a graph drawing program to visualize this graph. 
The common network structure estimated by the joint estimation method 
is shown in the top left panel of Figure 5. For the individual categories, 
we only plot the edges associated with the category that is not part of the 
common network, to enhance the readability of the graphs. As expected, 
members of the two political parties are clearly separated. For both tuning 
parameter values, there are strong positive associations between senators 
of the same party and selected strong negative associations between sen¬ 
ators of opposite parties. Obviously, at the higher tuning parameter value 
the common dependence structure becomes sparser. Of particular interest 
is the finding that at both tuning values there are many more associations 
between Democratic senators than Republican ones and this pattern holds 
for both the common and individual structures. One possible explanation 
may be that during that period the Democrats were in the minority and 
thus voting more frequently as a block. Further, the Independent Senator 
Jeffords is associated with the Democrats, while the moderate Republicans 
Collins, Snowe, Chafee and Specter (who switched to the Democratic party 
in early 2009) are not strongly associated with their Republican colleagues, 
thus confirming results of previous analyses by Clinton, Jackman and Rivers 
(2004) and de Leeuw (2006) (albeit based on data from the 105th Congress). 
The conservative Democrat Nelson (Nebraska) is also not closely associated 
with his party, as well as the very conservative Republican de Mint (South 
Carolina). Also, the analysis suggests that Senator Lieberman had a solid 
Democratic voting record before becoming an Independent in 2008. 

Other interesting patterns emerging from the analysis are that the more 
moderate members of the two parties are located closer to the center of their 
respective “clouds” (e.g., Warner, Frist, Voinovich and Smith on the Re¬ 
publican side, and Levin, Reid, Mikulski and Rockefeller on the Democratic 
side), the cluster of economic conservatives on the Republican side (Mc¬ 
Connell, Domenici, Crapo, Inhofe), the close ties of the liberal Democrats 
Kennedy, Boxer and Nelson (Florida), the close voting records of senators 
from the same state (Schumer and Clinton from New York, Murkowski and 
Stevens from Alaska, Snowe and Collins from Maine, Cantwell and Mur¬ 
ray from Washington). There is also a strong dependence between Durbin, 
Corzine, Lincoln, Harkin and Dodd on the Democratic side. 

Examining the individual networks for the three categories shown in 
Figure 5, we note that additional positive associations among Democrats 
emerge, primarily for defense and healthcare categories, thus indicating a 
stronger ideological cohesion on these issues. Further, a number of stable 
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negative associations emerge in the environment and healthcare categories, 
indicating a stronger ideological divide between senators. 

On defense, some additional strong ties emerge between more liberal lean¬ 
ing Democrats (Stabenow, Biden, Leahy, Kerry, Boxer), while a strong clus¬ 
ter on environmental issues arises between Republican senators from energy 
producing states (Murkowski and Stevens from Alaska, Thune from South 
Dakota, Hutchison from Texas, but also Bond from Missouri, Chambliss 
from Georgia, Craig from Idaho and Roberts from Kansas with their unwa¬ 
vering support for offshore drilling). On health and medical issues, a number 
of additional strong positive associations emerge among Democratic sena¬ 
tors, possibly reflecting the fact that the 109th Congress dealt with issues 
ranging from veterans affairs, to medical malpractice to food safety and es¬ 
pecially on health savings accounts legislation to reduce medical insurance 
costs. 

Different imputation strategies for missing data were also examined and 
the analysis results are given in Figures 1-3 in the Supplement for the same 
values of the cutoff a and tuning parameter A. It can be seen that similar 
patterns emerge, although alternative methods of imputation may lead to 
the emergence of a few more associations. Nevertheless, the main findings 
seem to be robust to the examined choices of the imputation mechanism, 
although at very high levels of absenteeism this may not hold [Han (2007)]. 

For comparison purposes, separate multidimensional scaling analyses are 
shown in Figure 6 for all the votes together and for the three categories 
separately. MDS (or PCA or factor analysis) is one of the commonly taken 
approaches in social sciences when graphical modeling is not considered. 
Figure 6 suggests that the overall vote clustering in the two parties is driven 
to a large extent by the corresponding clustering in the defense and health 
categories. On the other hand, voting on environmental issues creates a 
clear separation between the two parties, although the moderate Republicans 
Chafee, Collins and Snowe are shown to have a voting record similar to the 
Democrats, while the Democrats Nelson (Nebraska) and Landrieu are closer 
to the Republicans. At a high level, MDS-based findings are similar to ours, 
which is a satisfactory result, but they do not provide explicit clusters or 
edges, nor do they provide a way to quantify the amount of dependence 
between individual pairs (visualized via edge thickness in Figure 5). 

Another relevant comparison is to fitting a separate graphical model to 
each of the three categories, as could have been done with any of the previ¬ 
ously developed methods for fitting the Ising model. The results are shown 
in Figure 7, in the same format as in Figure 5, with edges common to all 
three categories shown under “common structure,” and all other edges un¬ 
der their own category. We followed the same tuning procedure as we did 
for joint estimation, bootstrapping the data 100 times for stability selection 
and selecting the value of the tuning parameter on a validation data set. 
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Defense and Security 









Environment and Energy 

mO'" 


Health and Medical Issues 



Fig. 6. Multidimensional scaling analysis for all the votes together, and the three individ¬ 
ual categories. The nodes represent the 100 senators, with red, blue and purple node colors 
corresponding to Republican, Democrat or Independent (Senator Jeffords), respectively. 


Even with the cutoff set at 1 (we included only the edges appearing in all 
the bootstrap replications), the graphs are dense and difficult to interpret. 
Similar to MDS, they capture party cohesion through strong positive as¬ 
sociations between members of the same party for all three categories and 
some negative associations between members of opposite parties. However, 
different voting patterns between categories are not clear, although the re¬ 
sults suggest a more cohesive voting record for both parties for the defense 
category. Note that since this is exploratory data analysis, it is hard to ver¬ 
ify which set of results is “better.” Nevertheless, those obtained from the 


ESTIMATING HETEROGENEOUS GRAPHICAL MODELS 


19 


Common Structure 


Defense and Security 


J>OI 

Inl^ 


McConnell 

inftnidDeWine 


Coi^ne 

|Hatfdn 

LinOaiinKteui 

KOhl 

»t«or Dftld 


Jeflter 


Bro'AOback 

Mc«Spes - , 

AI ^ .Gra»ley 
AlexOnder 

Ensign lou Th<«nM 

Sesflons SuShoWOf" 
Allen 


^'^"BSSciKen^y 

Johnson I Kory 
1 Biden 
Leahy 


Buning 


ChOfee Sp«tci 


snl^c 

C^iC 


ChaiObliss 

CoBiyn 

NWpOvich 


Saktm 

Bingaman 


C*ig 

Li^ar' 


FeingoldB^ ^•'‘Vleteon. ^ 

.SchOmer Lieb*Tnan 


ViOer 


HAcBci«Mby 

MaiOnez 
Murhfcwski 
ItcOens 


Taftnt 

Rofcns 

BiOns 

CoiAiSlf’** 






Environment and Energy 


Heaith and Medical Issues 


Dorbin 

Oir/Jnc 



Eig. 7. The estimated graphical models for the three categories in the Senate voting data 
fitted via separate estimation. Edges common to all three categories are shown under the 
heading “common structure”; all other edges are shown on category-specific graphs. The 
cutoff value is 1 (only edges appearing in all bootstrap replications are included). The nodes 
represent the 100 senators, with red, blue and purple node colors corresponding to Repub¬ 
lican, Democrat or Independent (Senator Jeffords), respectively. A solid line corresponds 
to a positive interaction effect and a dashed line to a negative interaction effect. The width 
of a link is proportional to the magnitude of the corresponding overall interaction effect. 


joint estimation method are more nuanced and interpretable and therefore 
provide better insights into voting strategies of members of Congress. 

5. Concluding remarks. We have proposed a joint estimation method 
for the analysis of heterogenous Markov networks motivated by the need to 
jointly estimate heterogeneous networks, such as those of the Senate vot- 
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ing patterns. The method improves estimation of the networks’ common 
structure by borrowing strength across categories, and allows for individual 
differences. Asymptotic properties of the method have been established. In 
particular, we show that the convergence rate is similar to the rate for Gaus¬ 
sian graphical models in a similar context [Guo et al. (2010)] . The proposed 
method can be extended to deal with general categorical data with more 
than two levels using the strategy described in Ravikumar, Wainwright and 
Lafferty (2010) and Guo et al. (2010). The most interesting feature emerging 
from the analysis of the Senate voting records is the existence of more stable 
associations for the Democrats, both in terms of the common structure and 
in the healthcare and defense categories. 

There are other techniques suitable for analyzing roll call data. Dimension 
reduction techniques create maps, where the relative positioning of the sen¬ 
ators allows one to infer similarity in their voting patterns. They provide a 
useful visual tool to capture broad patterns and relationships. On the other 
hand, a Markov network model aims directly at estimating the associations 
between the senators and thus provides an alternative view of the voting 
patterns, which together with the thresholding technique employed gives a 
measure of the stability of such associations. Further, the joint estimation 
method allows one to separately study the overall voting patterns and those 
driven by specific issues. In our view, both sets of techniques are useful, with 
dimension reduction providing a global perspective and the Markov model 
revealing more nuanced patterns. 


APPENDIX: ASYMPTOTIG PROPERTIES 


In this section we study the asymptotic properties of the proposed joint 
estimation method. Since the structure of the underlying network only de¬ 
pends on the interaction effects, we focus on a variant of the model without 
main effects. Specifically, we solve 


max 


K 

E 


rik P 

EE 




(fc) 


mi' 


(k) (k) 


(AT) 


- log<' 1 exp ( ^ 
j'¥=j 


^E 


3 <3 


A 


K 




k=l 


We will show that the estimator in criterion (A.l) is consistent in terms of 
both parameter estimation and model selection, when p and n go to infinity 
and the tuning parameter A goes to zero at some appropriate rate. We note 
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that our results are pointwise rather than uniform in 0, as is standard 
in the literature. Some interesting implications of nonuniform bounds for 
sparse estimators in linear regression have recently been discussed by Leeb 
and Potscher (2008), Potscher and Leeb (2009), although their conclusions 
do not apply to graphical models. 

Before stating the main results, we introduce necessary notation and reg¬ 
ularity conditions. For each k = 1,...,LL, denote ...,..., 

as a p{p — 1)/2-dimensional vector, recording all upper triangular 

elements in 0^^). Let 6^^^ be the true value of Let be the pop¬ 
ulation Fisher information matrix of the model in criterion (A.l) (see the 

(A:) 

Appendix for a precise definition) and let be a matrix with p rows 
and p{p — l)/2 columns, whose (j,j')th colnmn is composed of zeros ex¬ 
cept for the jth (j'th) component being Xjjv {xij). In addition, we dehne 

index the zero and nonzero elements, let = 

{(/,/): ^0,1 <j <j' < p} and = 0,1 < j < / < p}, 

and let 5n = ^u = uf=i Sk- The cardinalities of Sk and S'u are 

denoted by qk and q, respectively. For any matrix W and subsets of row 

and column indices U and V, let be the matrix consisting of rows lA 

and columns V in W. Finally, let Amin(-) and Ainax(') denote the smallest 
and largest eigenvalue of a matrix, respectively. 

The asymptotic properties of the joint estimation method rely on the 
following regularity conditions: 

(A) Nonzero elements bounds: There exist positive constants 7min and 7max 
such that: 

—ik) 

(i) mini<fc<ii-min(jj/)g5^ >7min; 

—ik) 

(ii) maxi<fc<;^max(jj/)g5^\5^ <7max- 

(B) Dependency: There exist positive constants Tmin and Tmax such that for 
any /c = 1,..., iL, 

(A.2) Amin(Qsj.) ^ Tmin and Amax(Ug^ g^) < Tmax- 

(C) Incoherence: There exists a constant r G (1 — y^iw/T^Ttnax) 1) such that 
for any A: = 1,..., AT, 

(^•3) ^lloo ^ 1 -'T- 

Condition (A) enforces a lower bound on the magnitudes of all nonzero 
elements, as well as an upper bound on the magnitudes of those nonzero 
elements associated with individual links. Conditions (B) and (C) bound 
the amount of dependence and the influence that the nonneighbors can have 
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on a given node, respectively. Conditions similar to (B) and (C) were also 
assumed by Meinshausen and Biihlmann (2006), Ravikumar, Wainwright 
and LafFerty (2010), Peng, Zhou and Zhu (2009) and Guo et al. (2010). Our 
conditions are most closely related to those of Guo et al. (2010), but here 
they are extended to the heterogenous data setting. 


Theorem 1 (Parameter estimation). Suppose all regularity conditions 
hold. If the tuning parameter A = Cxsjifogp)/n for some constant C\ > 
( 8 - 4 r)y/ 7 min/(l -t) and i/min{n/g3,ni/gf,... ,nx/g|^} > {A/C) logp for 
some constant C = min| T^j^ T^/288(l — T)^, T^j„ r^/72,rmin'r/48|, then there 

exists a local maximizer of the criterion (A.l), {6 }^^i, such that, with 

probability tending to 1, 


(A.4) 


K 

El 

k=l 




2<M 


qlogp 


n 


for some constant M > {2KC\/Trainx/lmm )(3 — 2t)I{2 — r). 


Theorem 2 (Structure selection). Under conditions of Theorem 1, with 

'^{k) 

probability tending to 1, the maximizer {6 from Theorem 1 satisfies 

^ ® for all {j,j') e Sk,k = l,...,K; 

^■5 = 0 forall{j,f)eSlk = l,...,K. 


Theorems 1 and 2 establish the consistency in terms of parameter esti¬ 
mation and structure selection, respectively. 

The main idea of the proofs is closely related to Guo et al. (2010), and 
some strategies for dealing with the joint estimation are borrowed from Guo 
et al. (2011). We introduce notation first. For the kth category, we define 
the log-likelihood as 


nfc p 

'(«'‘')=7EE 


i=l j=l 


(k) 


) - log<| 1 + exp 






,(fc) 


whose first derivative and second derivative are denoted by V/( 0 ^^^) and 
V 2 ^( 0 {^))^ respectively. Note that Vl{6^^^) is ap{p — 1)/2-dimensional vector 
and V^/(0^^^) is ap{p—l)/2xp{p—l )/2 matrix. Then, the population Fisher 
information matrix of the model in (A.l) at 0 can be dehned as Q = 
—E[V^/(0^^^)], and its sample counterpart is = —V^Z(0*'^^). We also 


write <^ 7 )^ for the sample counterpart of U''^^ Let 




(k) 
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5 j-') • • • p) ttie same as except that all elements 

in are set to zero and write 5^^^ = 6^^^ — 0^^'^ and — 0^^^. Finally, 

let W be a subset of the index set {1,2,... ,p{p — l)/2}. For a p{p — l)/2- 
dimensional vector /3, we define /3yy as the vector consisting of the elements 
of f3 associated with W. 

Next, we introduce a variant of criterion (A.l) by restricting all true zeros 
in {0^^^}^^ to be estimated as zero. Specifically, the restricted criterion is 
formulated as follows: 


(A.5) 


K 

max I (0^^^) — A , 

{0W\K ^ \ 

^k = \ k=l ^ 


K 


Eifi 

k=l 


(fc)| 


'^{k) 

and its maximizer is denoted by {0 jfcLi- addition, we consider the 
sample versions of regularity conditions (B) and (C): 

(B') Sample dependency: There exist positive constants Tmm and r^ax 
such that for any k = 1,... ,K, 

(A.6) Amin(Q52sfe) ^^nd Amax(U^^2sJ - 


(C') Sample incoherence: There exists a constant r € (1 — y^lnin/SlWx, 1) 
such that for any fe = 1,..., A, 

IIQy,5fc(QkA) ^lloo ^ 1 - 

For convenience of the readers, the proof of our main result is divided into 
two parts: Part I presents the main idea of the proof by listing the important 
propositions and the proofs of Theorems 1 and 2, whereas part II contains 
additional technical details and proofs of propositions in part I. 


Part I; Propositions and proof of Theorems 1 and 2. The proof consists 
of the following steps. Proposition 2 shows that, under sample regularity 
conditions (B') and (C'), the conclusions of Theorems 1 and 2 hold for 
the local maximizer of the restricted problem (A.5). Next, Proposition 3 
proves that the population regularity conditions (B) and (C) give rise to their 
sample counterparts (B') and (C') with probability tending to one, hence, 
the conclusions of Proposition 2 also hold with the population regularity 
conditions. Last, we show that the local maximizer of (A.5) is also a local 
maximizer of the original model (A.l). This is established via Proposition 4, 
which sets out the Karush-Kuhn-Tucker (KKT) conditions for the local 
maximizer of criterion (A.l), and Proposition 5, which shows that, with 
probability tending to one, the local maximizer of (A.5) satisfies these KKT 
conditions. 
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Proposition 2. Suppose condition (A) and the sample conditions (B') 
and (C) hold. If the tuning parameter A = Cx^{logp)/n for some constant 
C'a > (8 ~ 4 r)^ 7 inin /(1 — r) and q^/{logp)/n = o(l), then with probability 
tending to one, there exists a local maximizer of the restricted criterion, 

}f=i; satisfying: 

(i) Y^k=i 11^*^ ^ ~ lb < Af x/q{\ogp)/n for some constant M > {2KCxf 
Tmmy/'ymm) [(3 - 2t )/(2 - t)] / 

(ii) For each k = 1,... ,K, 0 ^-7 / 0 for all {j, j') £ Sk and ji = 0 for all 

Proposition 3. Suppose the regularity conditions (B) and (C) hold, 
then for any e > 0, the following inequalities hold with probability tending to 
one for all k = 1,..., K: 

(i) P{Aynin{Q^slsJ ^ '^min - e} < 2exp{-(e2/2)(nfc/g|) + 21oggfc}; 

(ii) P{Amax(U 52 sJ - '^max + 4 < 2exp{-(e2/2)(rifc/g^) + 21oggfc}; 

(iii) P[IIQ 5 -a(Q 52 sJ"^IU > 1 - 'r/2] < 12exp(-C'nfc/g| + 41ogp), for 
some constant C = min{rAj^r^/288(l — T)^, T^j„ T^/72,TminT/48|. 


Proposition 4. is a local maximizer of problem (A.l) if and 

only if the following conditions hold for all k = 1,, K: 

\ 1/2 


K 


(A.8) 


v,rm^) = Xsgn{ef})/{Y,\^S'\ 

\k=l 
1/2 








\k=l 


Proposition 5. Under all conditions of Proposition 2, with probability 
tending to one, we have, for each k = 1,..., K, 

/ K \ 1/2 


(A.9) 


V,,A(r) = Asgn(?;^,)/ E® 

\fc=i 

/ K \ 1/2 

i(fc) 


(fc) 


iv,y;(r')i<A/ Eis: 


■(^) I 


j for all {j,j')£Sk\ 

for all {j,j') G SI. 


\k=l 


Proof of Theorems 1 and 2 . The condition minjn/ q^,ni/qi,..., uk / 
q\} > (4/(7) logp implies that, for each k = 1,... ,K, we have —Cuk/qk + 
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41ogp < 0 and — (e^/2)(nfc/g|) + 2\ogqk < 0 when qk is large enough. This 
condition also implies q-\/ (logp)/n = o(l). In addition, by Proposition 3, the 
sample conditions (B') and (C') hold with probability tending to one when 
regularity conditions (B) and (C) hold. Therefore, by Proposition 2, with 

^(fc) 

probability tending to one, the solution of the restricted problem {0 }^i 

satisfies both parameter estimation consistency and structure selection con¬ 
sistency. On the other hand, by Proposition 5, with probability tending to 
^(fc) 

one, {0 }^=i also satisfies the KKT conditions in Proposition 4, thus, it is 

a local maximizer of criterion (A.l). This proves Theorems 1 and 2. □ 

Part II: Proofs of propositions. Before proving the propositions, we state 
a few lemmas which will be used in the proofs. These lemmas are variants 
of Lemmas 1, 2 and 5 in Guo et al. (2010), adapted to the settings of the 
heterogenous model and, thus, the proofs are omitted here. Likewise, the 
proof of Proposition 3 is very similar to the proof of Propositions 3 and 4 
in Guo et al. (2010) and is omitted. 


Lemma 1. For each k = 1,... ,K, with probability tending to 1, we have 

_ - - 

||V1(0 )IIoo < C^vv (logp)/n/or some constant C^/> 4:. 

Lemma 2. If the sample dependency condition (B') holds and 
q^/ (logp)/n = o(l), then for any G [0,1], k = 1,..., K, the following in¬ 
equality holds with probability tending to 1: 

(A.IO) - 

k=l k=l 


Lemma 3. Suppose the sample dependency condition (B) holds. For any 
CKfc G [0,1], k = 1,..., K, the following inequality holds with probability tend¬ 
ing to one: 


(A.ll) ||[V2/(0('^4afc#)) - V2Z(0('=))]5(^)||^ < wll^^'^^lls- 


Proof of Proposition 2. The main idea of the proof was first intro¬ 
duced in this context in Rothman et al. (2008) and has since been used by 
many authors. Define 


G({#>)L,) 
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(A.12) 
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It can be seen from (A.5) that {5^ minimizes and 

G({0}^j^) = 0. Thus, we must have G({d }k=i) — ^ closed 

set A which contains {0}^^ and show that G is strictly positive every¬ 
where on the boundary dA, then it implies that G has a local minimum 
inside A, since G is continuous and G({0}|^j^) = 0. Specihcally, we define 
^ :X:f=i lb < Man], with boundary dA = {{#)}f=i: 

lb = Man], for some constant M > (2iGC'A/rminV7min)[(3 - 2r)/ 
(2 — r)] and a„ = ^Jq{\ogp)/n. For any £ dA, the Taylor series 

expansion gives G({#)}f=i) = h + h+ h, where 


K 


k=l 

K ^ 

(A.13) h = -'^S^sl for some Ofc G [0,1], 

r / 


k=l 


^3=a E EiCG^ii 

0d')6Su \k=l 


1/2 


\l/2' 

Ei 5 f'' 


3,r 


\k=l 


Since Ca > (8 - 4r)y£}^/(l - r), we have [(1 - r)/(2 - T)\C\/y/j^ > 4. 
By Lemma 1, 


(A.14) 


i/ii<Eii[^'«'“)isjicoii47ii 

k=l 

< [(1 -r)CAM7”i^/V(2-T)](glogp)/re. 


In addition, by condition q-\J (logp)/n = o(l). Lemma 2 holds and, thus, 


K 

(A.15) h > (rmin/2) ^ll^^^^lla > [Tn,iJ{2K)]M^q{logp)/n. 

k=l 

Finally, by the triangular inequality and regularity condition (A), 

ii»55+4‘Eie£b 


141 < A E E ■ 


(i.p%suti (Et, i45'+(uti i9£'1)4^ 

<(A7-f)E E i47s(V47;;:f)Eiii 

k=l (i,i')G5u k=l 

< {MCx-f~ll^){q{logp)/n}. 


K ifl(fc) I 


(A.16) 
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Then we have 


Cx 


>0. 


lA TV'! r'{SX(.k)\K A ^ f "^min (l-'r)C-A 

(A.17) G({4< )},.,) >M 

The last inequality uses the condition M > (2iLC'A/rminV7min)[(3 —2 t)/(2 — 

r)]. Therefore, with probability tending to 1, we have 11^^ ^ 

M■sJq{\ogp)/n, and consequently claim (i) in Proposition 2 holds. 

-(fc) -(fc) 

On the other hand, by the definition of 0 , we have 6 ^= 0 for all 


|2 < 


-JU 


(j, j') G 5^. By regularity condition (A) and Proposition 2(i), for any G 

Sk, k = 1,..., K, we have \0^j]/ \ > — \^j]' — > 7min/2 > 0, when n 

is large enough. □ 


Proof of Proposition 5. By Proposition 2, with probability tend- 

^ ^{k) 

ing to one, we have ^ 0 for all (j,/) G S ^- Since {6 }^=i is a local 

maximizer of the restricted problem (A. 5), with probability tending to one. 


= Asgn(?J^/)/(X]f=i for all (j,/) G Sk - 

To show the second claim, we apply the mean value theorem and write 

+ r(^) - where r(^) = - 

\ After some simplihcations, we have 




(fc), 




(A.18) 
and, thus. 
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On the other hand, = +oo when G S^. Otherwise, 

if (j,/) G S’u \ S’fc, then 



/ K \ 1/2 f K 


1/2 


> A/V7 max ^ (2 - 2r)A/^ 


max 


mm* 


Thus, for any (j,/) G 5^ (/c = 1,..., K), we have 



(A.20) 
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